Target of selective auditory attention can be robustly followed with MEG

Selective auditory attention enables filtering of relevant acoustic information from irrelevant. Specific auditory responses, measurable by magneto- and electroencephalography (MEG/EEG), are known to be modulated by attention to the evoking stimuli. However, such attention effects have typically been studied in unnatural conditions (e.g. during dichotic listening of pure tones) and have been demonstrated mostly in averaged auditory evoked responses. To test how reliably we can detect the attention target from unaveraged brain responses, we recorded MEG data from 15 healthy subjects that were presented with two human speakers uttering continuously the words “Yes” and “No” in an interleaved manner. The subjects were asked to attend to one speaker. To investigate which temporal and spatial aspects of the responses carry the most information about the target of auditory attention, we performed spatially and temporally resolved classification of the unaveraged MEG responses using a support vector machine. Sensor-level decoding of the responses to attended vs. unattended words resulted in a mean accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$79\% \pm 2 \%$$\end{document}79%±2% (N = 14) for both stimulus words. The discriminating information was mostly available 200–400 ms after the stimulus onset. Spatially-resolved source-level decoding indicated that the most informative sources were in the auditory cortices, in both the left and right hemisphere. Our result corroborates attention modulation of auditory evoked responses and shows that such modulations are detectable in unaveraged MEG responses at high accuracy, which could be exploited e.g. in an intuitive brain–computer interface.

www.nature.com/scientificreports/ In this study, we propose a novel paradigm for eventual BCI applications that differs from the conventional cocktail party problem by employing simple, minimally overlapping word stimuli in two rapid sequences, thereby enabling fast tracking of the target of attention. By embedding sequence deviants, we can also include a task that allows behavioural quantification of the deployment of attention. Our aim was to provide a paradigm that could be efficiently used in a simple yet intuitive brain-computer interface.
To this end, we created an acoustically realistic scene with two concurrent auditory stimulus streams. Stimuli comprised of two human speakers uttering the words "Yes" and "No" in an interleaved manner at − 40 and + 40 degrees from the line forward from the subject, mimicking a real-life situation where two persons are speaking simultaneously on the sides of the subject. In each stream, the pitch of the word alternated (standard) but this implicit rule was occasionally broken by presenting two same-pitch versions of the stimulus word in succession (deviant). We measured MEG in 15 subjects while they were presented with these stimuli and were asked to covertly count these deviants in the attended stream and report the number at the end of each measurement block.

Results
Behavioral data. On average, the subjects reported 40 ± 17.6% (deviant probability 10%, N = 5 ) and 97 ± 0% (deviant probability 5%, N = 6 ) of the deviants in the stream they were instructed to attend to. Three subjects were not included in this analysis due to technical problems in collecting their deviant counts.
Sensor-level analysis. Time-resolved decoding was performed on the unaveraged epochs comprising all channels at each time point. At the group level, decoding "Attended No" vs. "Unattended No" and "Attended Yes" vs. "Unattended Yes" both showed peaks around 160 ms (Fig. 1a).
Spatially-resolved decoding indicated that the most informative signals arose from temporal regions; the patterns of decoding accuracy were qualitatively similar across the subjects; see Fig. 1b for a representative subject and for the group result.
To further characterize the stimulus-related information present in the auditory evoked responses, we decoded also for the stimulus word (not for attention). The mean accuracy was 88% ± 2% (range 75-96%) for the 1-s epochs and 93% ± 1% (range 82-99%) for the 2-s epochs when decoding "Attended Yes" vs. "Attended No". Similarly, when decoding for "Unattended Yes" vs. "Unattended No", we obtained an average decoding accuracy of 88% ± 2% (range 76-95%) for the 1-s epochs and 92% ± 2% (range 77-98%) for the 2-s epochs. Thus, the accuracy of word-wise decoding did not depend significantly on whether the words were attended or not ( p = 0.80 for the short epochs and p = 0.53 for the long epochs). Compared to attention decoding, this wordwise decoding gave statistically significantly higher accuracy for both the short ( p < 0.005 for all four possible comparisons) and long ( p < 0.005 ) epochs.
The average evoked responses to each attention condition ("Attended Yes", "Unattended Yes", "Attended No", "Unattended No") for a single subject and for the group can be found in Supplementary Fig. S1. In that figure, each condition represents pooled responses to the low-and high-pitch stimuli. These average evoked responses were computed only for the sensor-and source-level visualizations, and all decoding was performed on unaveraged (single-trial) responses.
The group-averaged evoked responses peaked at 250 ms (at channel 'MEG 1322') after the stimulus onset for the "Attended Yes" and at 136 ms ('MEG 1322') for the "Unattended Yes" condition. For the condition "Attended No", the responses peaked at 340 ms ('MEG 0242') and for "Unattended No" at 350 ms ('MEG 0212'). The planar gradient strength maps ( Supplementary Fig. S1) are compatible with sources in auditory cortices.
Source-level analysis. The group-level source estimates depicted in the Fig. 3 show the responses to attended and unattended word stimuli at three different latencies. In the right hemisphere, the activation peaked at 270 ms after the onset of the attended "Yes". Activation to the attended "No" peaked at 330 ms in the left hemisphere after the stimulus onset. The interindividual variation in the response latencies and amplitudes was considerably higher in the left vs. right hemisphere, which led to smearing of the group-average source dynamics in the left hemisphere (Fig. 3).
Spatial-searchlight decoding (Fig. 4) revealed that the source signals giving the highest decoding accuracy arose from the auditory cortices but significantly also from sensorimotor cortex that has been associated with auditory stimuli processing as well 36 . However, the spatial peaks of accuracy, as show in Fig. 4, did not align well in time and space across subjects, which led to low group-average accuracy for any single location on the cortex.

Discussion
In this study, we recorded brain signals while presenting simple, minimally overlapping spoken-word stimuli and demonstrated that the target of selective auditory attention to these concurrent streams can robustly (accuracy on average 79% and up to 91% in the best-performing subject) be decoded from just 1 s of MEG data. The decoding Classification mean accuracy, %

Yes
No ±SD Subject ID S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S15 www.nature.com/scientificreports/ accuracy peaked around 160 ms after the onset of both stimulus words and remained above chance level for several hundred milliseconds. The highest accuracy was obtained with signals arising from the auditory cortices. Previous studies have shown that non-semantic acoustic properties, such as sound-source-specific pitch 37 , are crucial for solving the cocktail party problem at the perceptual level 38,39 . Although our paradigm did not strictly adhere to the conventional cocktail-party definition as our stimulus words overlapped only minimally, the cortical representation of these properties and the MEG signals evoked by them may still contribute to our ability to decode the target of attention. The presence of pitch-related information in our data was demonstrated by the above-chance-level accuracy when decoding for the pitch (low vs. high pitch) instead of attention. Similarly, we obtained a high accuracy when decoding for the stimulus word (group mean of 88% for the 1-s epochs) or for the stimulus pitch (range 70-72%) instead of attention, which also speaks for the presence acoustic information in the MEG responses; differing stimulus durations (the two words and their pitch variants are of slightly different lengths) and the corresponding variation in the evoked responses is probably the most important feature that the decoder uses to achieve this higher accuracy.
Our behavioural results showed that by doubling the number of deviants in the stimulus sequence (10% instead of 5%), their detection rate dropped drastically, suggesting that the higher deviant rate was too demanding a task. The load theory of attention 40 S10 S15 Figure 4. Auditory cortical regions generate the signals most informative of the attended stimulus stream. The colour gradient (yellow highest) represents the source-space spatial searchlight decoding result averaged across the subject group (N = 11). Each color dot represents the accuracy peak in one subject. www.nature.com/scientificreports/ attention demand, the task performance drops, which is in line with our data. The current view of load theory 41 is that the load on working memory and cognitive control processes would hamper target detection, whereas load to visual short-term memory would do the opposite, that is, would reduce detecting distractors. However, the debate is still ongoing (for a review, see 42 ).
Earlier studies have demonstrated that rich naturalistic stimuli, compared to monotonous tones, not only improve the users' ergonomic evaluation of the situation but also yield higher decoding accuracy 21,43 . Further, it has been shown that subjects perform better on selective attention tasks when presented with naturalistic speech in comparison to other kind of naturalistic stimuli 44 . Thus, the naturalistic, spoken-word stimuli that we used have likely contributed to the high classification accuracy.
In our study, the pitch difference between the spoken-word streams due to speaker gender (male and female voices) likely helped focusing attention to one speaker. Yet, the pitch itself did not seem to play a role in either stream alone as the attention decoding accuracy for each word-stream was very similar (Fig. 2).
Stimulus timing likely has an effect on decoding accuracy as it influences the amplitude and latency of the attention-modulated evoked responses. Stimulus onset asynchrony (SOA) has been shown to affect accuracy in decoding attention to simple tones by Höhne et al. 45 ; they found that SOA of 1000 ms gave the best decoding accuracy but the highest information transfer rate was achieved with short SOA's (87-175 ms). Other studies that used virtual sound stimuli observed that SOA of 400-600 ms provided the best decoding accuracy 28,46 . Given those previous studies, our SOA of 1000 ms was likely optimal in terms of decoding accuracy but probably would not have yielded the highest information transfer rate, if our paradigm was applied in a brain-computer interface (BCI).
Typically, using a longer span of data for decoding improves accuracy if all data are informative; for example, Maÿe and colleagues have demonstrated this in the BCI context 47 . Also our results showed that using the long (2-s) instead of the short (1-s) epoch increased the accuracy of decoding the target of attention in all 14 subjects. Again, for a BCI, the long epochs may not be the optimal choice to maximize the information transfer rate.
Spatial searchlight decoding across the cortex yielded accuracy peaks at locations similar to those of the largest differences in the source estimates of the evoked responses to the two attention conditions. This agreement of the two analysis methods further supports the notion that the selective auditory attention-modulated cortical activity is mostly in the primary auditory cortex 48 . Regardless of the roughly similar cortical location of the most attention-informative source in each subject, these locations did not fully overlap, which led to a dispersed group average even though interindividual variation of cortical anatomy was reduced by surface-based morphing of the individual brains to an average brain. This variation-although minor-in the location and orientation of the source providing the highest decoding accuracy likely means that classifiers do not generalize well across subjects but that the classifier should be trained separately for each subject if one is aiming to the highest accuracy.
Left and right hemispheres are differently specialised to process auditory stimuli. Language-specific areas are typically lateralised to the left hemisphere 49 . For instance, left hemisphere has been found to respond more than the right to the temporal aspects of auditory stimuli 50 . For comparison, right hemisphere has been found to be more involved in spectral processing of e.g. tones and music 51 . Previous studies found that right hemisphere responds to the manipulation of pitch in human speech 50,52,53 .
Based these previous findings and our data, we suggest that selective attention is engaged to follow the regular pitch alternation and thus to support the detection of its infrequent deviants in the indicated stimulus sequence. In our data, such an engagement was manifested in the responses in the right hemisphere (attended vs. unattended stimuli) at around 270 ms after the stimulus onset. The left hemisphere was activated later (at around 330 ms) and had higher activity for the attended vs. unattended stimuli while such a difference was not as clear in the right hemisphere. The peak of left-hemisphere source could potentially be related to the processing of lexical/semantic information as, e.g., for visual word stimuli 54 .
It is conceivable that in our experiment participants applied different strategies of keeping their attention to one auditory stream or they might have even changed their strategy in the course of the experiment. This possibility could be studied by training the decoder by samples from specific parts of the recording (e.g. only from the beginning) and comparing the obtained classification accuracy. In addition, the influence of stimulation rate to selective attention and its decoding from brain signals could be tested. Moreover, future studies could assess individual differences in response latency and spatial patterns on the MEG sensor array that may limit across-subject generalization.
Using the current experimental paradigm, one could test how robustly the observed attention modulation of brain responses could be detected by EEG instead of MEG. Spatial separability of cortical sources is typically poorer in EEG compared to MEG 55,56 and thus the accuracy of decoding the attended word stream from EEG would likely be lower; yet, the accuracy could remain at a level which enables a portable and intuitive brain-computer interface.

Conclusions
We showed that the attended spoken-word stream can reliably be decoded from just one-second epochs of unaveraged MEG data. The achieved high decoding accuracy shall enable future investigations on the neural mechanisms of attentional selection and it may also be exploited in a MEG-or EEG-based streaming braincomputer interface.

Materials and methods
Participants. Fifteen healthy adult volunteers (4 females, 11 males; mean age 28.8±3.8 years, range 23-38 years) participated in our study. Two subjects were left-handed and the rest right-handed. Participants did not report hearing problems or history of psychiatric disorders. The study was approved by the Aalto University www.nature.com/scientificreports/ Research Ethics Committee. The research was carried out in accordance with the guidelines of the Declaration of Helsinki, and the subjects gave written informed consent prior the measurements.
Stimuli and experimental protocol. The subjects were presented with two auditory streams, one comprising the spoken word "Yes" and the other the word "No". The words alternated such that the words did not overlap. In each stream ("Yes" and "No"), the stimulus onset asynchrony (SOA) was 1000 ms, and the duration of the stimulus words were 450-550 ms depending on the word and its pitch variant. Fig. 5 illustrates the stimulus timing (Fig. 5a), positioning in the acoustic scene (Fig. 5b) and the structure of the stimulus sequence Fig. 5c. To create a realistic acoustic scene, the stimuli were recorded with a dummy head (Mk II, Cortex Instruments GmbH, Germany) at the center of a room with dimensions comparable to those of the magnetically shielded room where the MEG recordings were performed later. The speakers were standing at about at − 40 and + 40 degrees from the front-line of the dummy head at a distance of 1.13 m. The word "Yes" was uttered by a female and the word "No" by a male speaker. Thus, in the experiment, the sound of each speaker was presented to both ears of the subject as recorded with the dummy head; see Fig. 5a.
Subjects were asked to attend to one of the two asynchronously alternating spoken word streams at a time by a visual cue ("LEFT-YES" or "RIGHT-NO") shown next to the fixation cross and accordingly on left or right side of the screen. Each stream had two alternating pitches of the spoken word (denoted as [..., yes, YES, yes, YES, ...] and as [..., no, NO, no, NO, ...]). The original voice recordings were used as the low-pitch stimuli, and the pitch was increased by 13% and 15% for the high-pitch versions of "Yes" and "No", respectively. In each spoken-word stream, occasional violations (deviants) of otherwise regular pitch alternation (standards) occurred. The subjects were instructed to count the deviants in order to keep their attention to the indicated spoken word stream stimuli. The stimulus sequence always started with the low-pitch word; see Fig. 5c.
Deviants were presented with the probability of 10% in both streams for the first seven subjects and with probability of 5% for the rest of the subjects. The deviant frequency was decreased based on subject feedback to reduce the mental load of memorizing the deviant count.
The experiment comprised 8 blocks, each lasting about 135 s. Two seconds before a block started, the subject was instructed to direct his/her attention to one of the streams by the cues "LEFT-YES" or "RIGHT-NO" on the screen. The task of the subject was to focus on the indicated word stream, covertly count the deviants, maintain gaze at the fixation cross displayed on the screen and verbally report the count at the end of the block.
During the cue "LEFT-YES", the evoked responses to the word "Yes" were assigned to the condition "Attended Yes" and the evoked responses to "No" were assigned to the condition "Unattended No". Similarly, during the cue "RIGHT-NO", evoked responses to "No" were assigned to the condition "Attended No" and, accordingly, evoked responses to "Yes" were assigned to the condition "Unattended Yes".
The experiment always started with a block with the cue "LEFT-YES" and was followed by a block with the cue "RIGHT-NO". The order of the remaining six blocks was randomized across subjects. The first blocks were not randomised due to our main goal to use the first two blocks for training the classifier. The total length of the experiment was 50-60 min including the breaks between the blocks.
PsychoPy version 1.79.01 57,58 Python package was used for controlling and presenting the auditory stimuli and visual instructions. www.nature.com/scientificreports/ Electron Industry Co., Ltd, Taichung, Taiwan), and custom-built loudspeaker units outside of the shielded room and plastic tubes conveying the stimuli separately to the ears. Sound pressure was adjusted to a comfortable level for each subject individually. Due to timing inaccuracies in the stimulus presentation system, the delay from the trigger to sound onset for the "Yes" stimuli varied with a standard deviation of 7 ms while that for "No" varied with a standard deviation of 11 ms.
MEG data acquisition. MEG measurements were performed with a whole-scalp 306-channel Elekta-Neuromag VectorView MEG system (MEGIN Oy, Helsinki, Finland) at the MEG Core of Aalto Neuroimaging, Aalto University. During acquisition, the data were filtered to 0.1-330 Hz and sampled at 1 kHz. Prior to the MEG recording, anatomical landmarks (nasion, left and right preauricular points), head-position indicator coils, and additional scalp-surface points (around 100) were digitized using an Isotrak 3D digitizer (Polhemus Navigational Sciences, Colchester, VT, USA). Bipolar electrooculogram (EOG) with electrodes positioned around the right eye (laterally and below) was recorded. Fourteen of the 15 subjects were recorded with continuous head movement tracking. All subjects were measured in the seated position. The back-projection screen for delivering the visual instructions was 1 m from the eyes of the subject. If needed, vision was corrected by nonmagnetic goggles.
The MEG recording of one subject had technical problems and this dataset had to be dropped from the analysis.
Data pre-processing. The MaxFilter software (version 2.2.10; MEGIN Oy, Helsinki, Finland) was applied to all MEG data (magnetometers and planar gradiometers) to suppress external interference using temporal signal space separation and to compensate for head movements 59  Infinite-impulse-response filters (4th-order Butterworth, applied both forward and backward in time) were employed to filter the unaveraged MEG data to 0.1-30 Hz for visualization of the evoked responses and for sensor-and source-level decoding. Ocular artifacts were suppressed by removing those independent components (1-4 per subject, on average 3) that correlated most with the EOG signal.
For the subsequent data analysis, only planar gradiometers were used due to the straightforward interpretation of their spatial pattern; they show the maximum signal right above the active source.
Epochs with two different pre-(200 ms and 500 ms) and post-stimulus (800 ms and 1500 ms) periods were extracted from the MEG data at every word stimulus. Epochs were rejected if any of the planar gradiometer signals exceeded 4000 fT/cm. Deviant epochs were excluded from data analysis. The trial counts were equalized, and the responses averaged across each condition ("Attended Yes", "Attended No", "Unattended Yes" and "Unattended No") for visualization and source estimation.
Source estimation. Head models were constructed based on individual magnetic resonance images (MRIs) by applying the watershed algorithm implemented in the FreeSurfer software 63-65 , version 5.3. Using the MNE software, single-compartment boundary element models (BEM) comprising 5120 triangles were then created based on the inner skull surface. In addition to the one subject with technical problems in MEG recording, the MRIs of three subjects were not available, leaving 11 subjects for the source estimation.
For the source space, the cortical mantle was segmented from MRIs using FreeSurfer and the resulting triangle mesh was subdivided to 4098 sources per hemisphere. The dynamic statistical parametric mapping 66 , dSPM; variant of minimum-norm estimation was applied to model the activity at these sources. The noise covariance was estimated from the 2-min resting-state measurement of each subject. These data were pre-processed similarly as the task-related data.
The source amplitudes for the attention conditions "Attended Yes", "Unattended Yes","Attended No" and "Unattended No" were estimated for all subjects individually. For the group-level source estimate, the obtained source amplitudes were first normalized such that the absolute peak value of the attended condition became one, the estimates were morphed to the FreeSurfer average brain and then averaged across subjects. The morphing procedure from individual brains to the average brain is described by Greve et al. 67 . Decoding. Sensor-level decoding. A linear support vector machine 68 , SVM; classifier implemented in the Scikit-learn package 62 was applied to unaveraged epochs to decode the conditions "Attended Yes" vs. "Unattended Yes" and "Attended No" vs. "Unattended No". For comparison, decoding was also performed stimulusword-wise, i.e. "Attended Yes" + "Unattended Yes" vs. "Attended No" + "Unattended No". In addition, pitch-wise and single pitch variant attention-wise decoding were performed.
The pre-processed MEG data (filtered to 0.1-30 Hz) were down-sampled by a factor of 8 to a sampling rate of 125 Hz to reduce the number of features while preserving sufficient temporal information. Amplitudes of the planar gradiometer channels were concatenated to form the feature vector. Shuffled five-fold cross-validation (CV) was applied with an 80/20 split; 80% of data were used for training and the rest for testing. The empirical chance level was around 55% for our sample size of 500 epochs in this two-class decoding task 69 . We also verified the empirical chance level by the method by Ojala and Garriga 70 and in our case it was 53%.
Decoding was separately performed on data of (1)  www.nature.com/scientificreports/ Source-level decoding. A linear SVM decoder with five-fold cross-validation (80%/20% split for training/testing) was applied to the individual source estimates for the attention conditions "Attended Yes" vs. "Unattended Yes" and "Attended No" vs. "Unattended No" calculated from all MEG planar gradiometer channels. A spatial searchlight decoding across the source space was used on the 1-s (− 200 to 800 ms after stimulus onset) epochs, and the resulting accuracy maps were morphed to the FreeSurfer average brain and averaged across the subjects (N = 11). The accuracy maps for attention conditions "Attended Yes" vs. "Unattended Yes" and "Attended No" vs. "Unattended No" were then averaged to obtain a general accuracy map.

Data availability
The datasets generated and analysed during the current study are not publicly available due to the local legislation on research on humans but are available from the corresponding author on reasonable request. www.nature.com/scientificreports/